A Novel algorithm for identifying low-complexity regions in a protein sequence

نویسندگان

Xuehui Li

Tamer Kahveci

چکیده

MOTIVATION We consider the problem of identifying low-complexity regions (LCRs) in a protein sequence. LCRs are regions of biased composition, normally consisting of different kinds of repeats. RESULTS We define new complexity measures to compute the complexity of a sequence based on a given scoring matrix, such as BLOSUM 62. Our complexity measures also consider the order of amino acids in the sequence and the sequence length. We develop a novel graph-based algorithm called GBA to identify LCRs in a protein sequence. In the graph constructed for the sequence, each vertex corresponds to a pair of similar amino acids. Each edge connects two pairs of amino acids that can be grouped together to form a longer repeat. GBA finds short subsequences as LCR candidates by traversing this graph. It then extends them to find longer subsequences that may contain full repeats with low complexities. Extended subsequences are then post-processed to refine repeats to LCRs. Our experiments on real data show that GBA has significantly higher recall compared to existing algorithms, including 0j.py, CARD, and SEG. AVAILABILITY The program is available on request.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Detection and Discrimination of Theileria annulata and Theileria lestoquardi by using a Single PCR

The aim of this study was to detect and differentiate Theileria annulata and T. lestoquardi (hirci) by PCR. Members of the genus Theileria are tick-borne hemoprotozoan parasites those cause fatal and enervating diseases of cattle and sheep in Iran . In order to develop a specific method for detecting and identification of Theileria species, specific primers from the surface protein (SP) seque...

متن کامل

تخمین مکان نواحی کدکننده پروتئین در توالی عددی DNA با استفاده پنجره با طول متغیر بر مبنای منحنی سه بعدی Z

In recent years, estimation of protein-coding regions in numerical deoxyribonucleic acid (DNA) sequences using signal processing tools has been a challenging issue in bioinformatics, owing to their 3-base periodicity. Several digital signal processing (DSP) tools have been applied in order to Identify the task and concentrated on assigning numerical values to the symbolic DNA sequence, then app...

متن کامل

A method for identifying software components based on Non-dominated Sorting Genetic Algorithm

Identifying the appropriate software components in the software design phase is a vital task in the field of software engineering and is considered as an important way to increase the software maintenance capability. Nowadays, many methods for identifying components such as graph partitioning and clustering are presented, but most of these methods are based on expert opinion and have poor accur...

متن کامل

ON FUZZY NEIGHBORHOOD BASED CLUSTERING ALGORITHM WITH LOW COMPLEXITY

The main purpose of this paper is to achieve improvement in thespeed of Fuzzy Joint Points (FJP) algorithm. Since FJP approach is a basisfor fuzzy neighborhood based clustering algorithms such as Noise-Robust FJP(NRFJP) and Fuzzy Neighborhood DBSCAN (FN-DBSCAN), improving FJPalgorithm would an important achievement in terms of these FJP-based meth-ods. Although FJP has many advantages such as r...

متن کامل

A novel chimeric recombinant protein PDHB-P80 of Mycoplasma agalactiae as a potential diagnostic tool

The aim of this study was to construct, expression of a novel recombinant chimeric protein consisting of Pyruvate dehydrogenase beta subunit (PDHB) and high antigenic region of integral membrane lipoprotein P80 of Mycoplasma agalactiae as a potential diagnostic tool. The full-length sequence of pdhb and a portion of antigenic regions of P80 were selected and analyzed by CLC ma...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Bioinformatics

دوره 22 24 شماره

صفحات -

تاریخ انتشار 2006

A Novel algorithm for identifying low-complexity regions in a protein sequence

نویسندگان

چکیده

منابع مشابه

Detection and Discrimination of Theileria annulata and Theileria lestoquardi by using a Single PCR

تخمین مکان نواحی کدکننده پروتئین در توالی عددی DNA با استفاده پنجره با طول متغیر بر مبنای منحنی سه بعدی Z

A method for identifying software components based on Non-dominated Sorting Genetic Algorithm

ON FUZZY NEIGHBORHOOD BASED CLUSTERING ALGORITHM WITH LOW COMPLEXITY

A novel chimeric recombinant protein PDHB-P80 of Mycoplasma agalactiae as a potential diagnostic tool

عنوان ژورنال:

اشتراک گذاری